-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent users from using document block APIs when sort is configured #12711
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, I believe that there is an alternative, that would consist of recording the block structure and preserving it through index sorting. That said, it looks like a lot of trouble and I'm unsure how frequently index sorting and blocks would be used together.
This change has the merit of making it clear that these two features don't play well together, so +1 to making this change, we can always allow combining index sorting and blocks later on.
Let's add a note to MIGRATE.txt and fix nightly benchmarks (e.g. https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/IndexTaxis.java) before merging?
This is what we do today: we're careful to add blocks of docs that sort together. What is the alternative going to be? Instead one should sequentially call addDocument()? I have a question about this; is indexing sorting stable? If I add multiple docs in a block that share the same sort key, will their order be preserved? I believe this is the case in our current index - so we need the guarantee, and the sorting. If that still works with sequential calls to addDocument, I guess that we can shift to doing that, but in that case I don't understand how preventing the use of this API will be helpful. It seems as if this is sort of an "advice to users" change that really doesn't alter anything substantive? Could we address it with documentation? |
I don't think we should make a hard block here. As @msokolov points out, if you are careful, so your static sort is congruent with your blocks, the blocks will be preserved. I agree it is trappy, if you don't use a congruent static sort. Can we somehow allow for expert usage to continue to use both of these together? Maybe there is some way to detect that the block of docs is congruent with the static sort? |
Would an expert API on the Sort work for you folks? Like a getter that indicates if it’s a stable sort and preserves blocks?
|
I spoke to @jpountz about this topic and we discussed a different approach. We could get away with not having the check at all and make blocks a first class citizen by recording the parent document in a docvalues field. Really, if we'd be implementing the feature today would we use a bitset or maybe a sparse DV field recording the number of children for each block in the index? I guess the latter... We could ask the user for a field to record this in IWC and make blocks a first class citizen. That way we can sort only on the parent documents which would make things for you (@mikemccand and @msokolov) and maybe other users simpler and more safe. You'd still have the power to do what you do today but we can fix documentation that your blocks might be messed up unless you set a field on IWC. Full BWC and less trouble for the users? |
The idea of more explicitly modeling doc-blocks makes sense to me, but I wonder if it would really enable "That way we can sort only on the parent document". What about the case (supported today) where we have nested blocks? I guess in that case we would have to figure out which parent docs are "top-level" and only sort by those? The idea of adding an isStable flag to the index sort also makes sense to me and would work if it's not possible to guarantee that in general for all index sorting as we do today. But again, I wonder if it makes things any easier to implement? |
Another question: do we have any testing around this sort-stability / block-preservation today? I'm getting nervous now that we are relying on an undocumented feature that just happens to work. EG I checked TestIndexSorting and it doesn't seem to call add/updateDocuments at all. |
I don't think we have any tests for this. Otherwise the build would have failed on this PR
Wait, correct me if I am wrong but we don't support nested blocks with an API. You can model that with certain fields on the docs in the block but there is no API. The purpose of my suggestion was to make the root block a first class citizen. @mikemccand @martijnvg any opinions on this suggestion from above: #12711 (comment) |
In fact, in order to make use of your doc blocks at search time ( But is the idea just to prevent incorrect usage (mixing non-congruent sort with doc blocks)? Or, is it to fix whatever sort the user provides so that doc blocks will work correctly (i.e. whole doc blocks get sorted atomically, order preserved, according to sort criteria only on the parent docs)? The latter would be great: users need only concern themselves with how parent docs sort. And I think the nested case would "just work" since the entire block (and sub-blocks) is preserved in the order it was originally indexed.
+1 to first improve testing of this. Scary the combination of static sort and doc blocks is untested. Maybe we already broke this in 9.x / main! |
yeah so this would be step no. 1 followed by all kinds of defaults we can derive from that. Like sorting only parents, auto providing bitsets for the queries etc. We would need to think about what we would do if someone deletes a parent doc, would we also delete all children? that's a tricky question how far we wanna go here |
I think adding parent doc values field to the IWC would make block indexing more robust than is today. Other tools would then also know that a Lucene index contains blocks.
+1 to keep full bwc. Although I think if this feature were to be build today, then I think failing (incase no parent field is configured) would be the best option here. |
@mikemccand @msokolov @jpountz @martijnvg see #12829 for reference and further discussions |
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to apache#12711
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to apache#12711
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711
Today you can use the
add/UpdateDocuments
API even if a index sort is configured. This leads to broken indices if users rely on the guarantees of this API that document IDs are consecutive. This change prevents API calls that create blocks if a sort is configured. Note: this is a breaking change for all users that used this API only for convenience purposes to prevent iterating over documents.